{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n2D Convolution Optimization\n===========================\n**Author**: `Thierry Moreau <https://homes.cs.washington.edu/~moreau/>`_\n\nThis tutorial provides an overview on how to use TVM to map a 2D convolution\nworkload efficiently on the VTA design.\nWe recommend covering the `vta-mat-mult-opt` tutorial first.\n\n2D convolution is dominant in most computer vision deep neural networks.\nIn this tutorial, we will demonstrate TVM schedule optimizations to map\n2D convolution operators in NCHW layout onto VTA.\nWe also introduce the notion of latency hiding, which allows us to\nmaximize VTA's compute and memory resource utilization.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "RPC Setup\n---------\nWe start by programming the Pynq's FPGA and building its RPC runtime.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import, print_function\n\nimport os\nimport tvm\nimport tvm.testing\nfrom tvm import te\nimport vta\nimport numpy as np\n\nfrom tvm import rpc\nfrom tvm.contrib import utils\nfrom vta.testing import simulator\n\n# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\nenv = vta.get_env()\n\n# We read the Pynq RPC host IP address and port number from the OS environment\nhost = os.environ.get(\"VTA_RPC_HOST\", \"192.168.2.99\")\nport = int(os.environ.get(\"VTA_RPC_PORT\", \"9091\"))\n\n# We configure both the bitstream and the runtime system on the Pynq\n# to match the VTA configuration specified by the vta_config.json file.\nif env.TARGET == \"pynq\":\n\n    # Make sure that TVM was compiled with RPC=1\n    assert tvm.runtime.enabled(\"rpc\")\n    remote = rpc.connect(host, port)\n\n    # Reconfigure the JIT runtime\n    vta.reconfig_runtime(remote)\n\n    # Program the FPGA with a pre-compiled VTA bitstream.\n    # You can program the FPGA with your own custom bitstream\n    # by passing the path to the bitstream file instead of None.\n    vta.program_fpga(remote, bitstream=None)\n\n# In simulation mode, host the RPC server locally.\nelif env.TARGET in [\"sim\", \"tsim\"]:\n    remote = rpc.LocalSession()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Computation Declaration\n-----------------------\nAs a first step, we need to describe our 2D convolution computation\nin NCHW format.\n\nWe define the 2D convolution shape by the batch size,\nspatial dimensions, input channels, output channels, kernel dimensions,\nkernel dimensions, padding dimensions, and stride dimensions.\n\nWe pick the shape of the 9th convolutional layer of the ResNet-18\narchitecture as our convolution workload parameters.\n\nWe've added extra operators to the 2D convolution that apply\nshifting and clipping to the output in order to mimic a fixed-point\nconvolution followed by a rectified linear activation.\nWe describe the TVM dataflow graph of the 2D convolution layer below:\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/conv2d_dataflow.png)\n\n     :align: center\n\nThis computation is intentionally too large to fit onto VTA's on-chip\nbuffers all at once. Therefore in the scheduling phase we'll\nrely on computation blocking strategies to break the computation down into\nmanageable chunks.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>*Spatial padding*\n\n  Note that we'll need to import the TOPI library to apply spatial padding\n  on the input feature map tensor.\n  Spatial padding facilitates blocking in the context of 2D convolutions\n  due to the fact that the same (x, y) spatial location of the input\n  feature map of any given layer is read more than once if the convolution\n  kernel window size is greater than one.\n  On CPUs, and GPUs, one way to increase efficiency of memory accesses\n  when parallelizing work is spatial packing, which requires data re-layout.\n  VTA load DMA engine can insert padding automatically so that the original\n  input feature map does not have to be re-packed in memory.\n\n  We show the effect of VTA's on the fly spatial padding when data is being\n  loaded from DRAM into VTA's SRAM, following a 2D strided and padded memory\n  read.\n\n  .. image:: https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/padding.png\n       :align: center\n       :width: 480px</p></div>\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from tvm import topi\n\n# 2D convolution layer dimensions taken from ResNet-18 architecture\n# (9th convolutional layer)\nbatch_size = 1\nheight = 14\nwidth = 14\nin_channels = 256\nout_channels = 256\nkernel_h = 3\nkernel_w = 3\npad_h = 1\npad_w = 1\nstride_h = 1\nstride_w = 1\nassert batch_size % env.BATCH == 0\nassert in_channels % env.BLOCK_IN == 0\nassert out_channels % env.BLOCK_OUT == 0\n\n# Input feature map: (N, IC, H, W, n, ic)\ndata_shape = (\n    batch_size // env.BATCH,\n    in_channels // env.BLOCK_IN,\n    height,\n    width,\n    env.BATCH,\n    env.BLOCK_IN,\n)\n# Kernel: (OC, IC, H, W, oc, ic)\nkernel_shape = (\n    out_channels // env.BLOCK_OUT,\n    in_channels // env.BLOCK_IN,\n    kernel_h,\n    kernel_w,\n    env.BLOCK_OUT,\n    env.BLOCK_IN,\n)\n# Derive output feature map dimensions\nfout_height = (height + 2 * pad_h - kernel_h) // stride_h + 1\nfout_width = (width + 2 * pad_w - kernel_w) // stride_w + 1\n# Output feature map: (N, OC, H, W, n, oc)\noutput_shape = (\n    batch_size // env.BATCH,\n    out_channels // env.BLOCK_OUT,\n    fout_height,\n    fout_width,\n    env.BATCH,\n    env.BLOCK_OUT,\n)\n\n# Convolution reduction axes\ndy = te.reduce_axis((0, kernel_h), name=\"dy\")\ndx = te.reduce_axis((0, kernel_w), name=\"dx\")\nic = te.reduce_axis((0, in_channels // env.BLOCK_IN), name=\"ic\")\nic_tns = te.reduce_axis((0, env.BLOCK_IN), name=\"ic_tns\")\n\n# Input placeholder tensors\ndata = te.placeholder(data_shape, name=\"data\", dtype=env.inp_dtype)\nkernel = te.placeholder(kernel_shape, name=\"kernel\", dtype=env.wgt_dtype)\n\n# Copy buffers:\n#   Apply spatial padding to input feature map\ndata_buf = topi.nn.pad(data, [0, 0, pad_h, pad_w, 0, 0], name=\"data_buf\")\nkernel_buf = te.compute(kernel_shape, lambda *i: kernel(*i), \"kernel_buf\")\n\n# Declare 2D convolution\nres_conv = te.compute(\n    output_shape,\n    lambda bo, co, i, j, bi, ci: te.sum(\n        data_buf[bo, ic, i * stride_h + dy, j * stride_w + dx, bi, ic_tns].astype(env.acc_dtype)\n        * kernel_buf[co, ic, dy, dx, ci, ic_tns].astype(env.acc_dtype),\n        axis=[ic, dy, dx, ic_tns],\n    ),\n    name=\"res_conv\",\n)\n\n# Add shift stage for fix-point normalization\nres_shr = te.compute(output_shape, lambda *i: res_conv(*i) >> 8, name=\"res_shr\")\n\n# Apply clipping between (0, input max value)\ninp_max = (1 << (env.INP_WIDTH - 1)) - 1\nres_max = te.compute(output_shape, lambda *i: tvm.te.max(res_shr(*i), 0), \"res_max\")\nres_min = te.compute(output_shape, lambda *i: tvm.te.min(res_max(*i), inp_max), \"res_min\")\n\n# Result Tensor\nres = te.compute(output_shape, lambda *i: res_min(*i).astype(env.inp_dtype), name=\"res\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Scheduling the Computation\n--------------------------\nWe'll look at a set of schedule transformations necessary to map the\n2D convolution onto VTA in an efficient fashion.\nThose include:\n\n- Computation blocking\n- Virtual threading to increase compute utilization\n- Lowering to VTA hardware intrinsics\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Create TVM schedule\ns = te.create_schedule(res.op)\n# Let's look at the default TVM schedule\nprint(tvm.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Blocking the Computation\n~~~~~~~~~~~~~~~~~~~~~~~~\nThe 2D convolution is by default too large for activations or kernel weights\nto fit on VTA's on-chip buffers all at once.\nWe apply blocking along input channels, output channels, and along\nthe height spatial dimensions.\nWe don't apply blocking along the width spatial dimension since it's\nthe innermost dimension in the NCHW layout (and consequently to increase\nlocality, it's best not to block along the innermost dimension).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Let's define tiling sizes\nb_block = 1 // env.BATCH\noc_block = 128 // env.BLOCK_OUT\nic_block = 16 // env.BLOCK_IN\nh_block = 7\nw_block = 14\n\n# Tile the output tensor along the spatial and output channel dimensions\n# (since by default we are doing single batch inference, the split along\n#  the batch dimension has no effect)\nb, oc, y, x, b_tns, oc_tns = s[res].op.axis\nb_out, b_inn = s[res].split(b, factor=b_block)\noc_out, oc_inn = s[res].split(oc, factor=oc_block)\ny_out, y_inn = s[res].split(y, factor=h_block)\nx_out, x_inn = s[res].split(x, factor=w_block)\ns[res].reorder(b_out, oc_out, y_out, x_out, b_inn, oc_inn, y_inn, x_inn, b_tns, oc_tns)\n\n# Move intermediate computation into each output compute tile\ns[res_conv].compute_at(s[res], x_out)\ns[res_shr].compute_at(s[res], x_out)\ns[res_max].compute_at(s[res], x_out)\ns[res_min].compute_at(s[res], x_out)\n\n# Apply additional loop split along reduction axis (input channel)\nb_inn, oc_inn, y_inn, x_inn, b_tns, oc_tns = s[res_conv].op.axis\nic_out, ic_inn = s[res_conv].split(ic, factor=ic_block)\n\n# Reorder axes.\n# 1) Group the VTA tensor axes in the inner most position: b_tns, oc_tns, ic_tns\n#    to allow TVM to tensorize.\n# 2) We move the ic_out axis all the way out of the convolution loop to block\n#    along the reduction axis.\n# 3) Now we re-order the block axes: b_inn, oc_inn, y_inn, x_inn, ic_inn, dy, dx.\n#    VTA runtime/hardware requires us to write to a different output feature map\n#    location for every VTA tensor operation.\n#    This restriction requires us to order one of oc_inn, y_inn or x_inn right\n#    before b_tns, since they all affect output feature map indexing.\n#    Therefore, we choose to bring x_inn inside as shown below.\ns[res_conv].reorder(ic_out, b_inn, oc_inn, y_inn, ic_inn, dy, dx, x_inn, b_tns, oc_tns, ic_tns)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Virtual Threading\n~~~~~~~~~~~~~~~~~\nVirtual threading is a mechanism that increases task-level pipeline\nparallelism in the VTA hardware design.\nPut it another way, it increases compute resource utilization by hiding\nmemory access latency.\n\nIn the implementation below, virtual threading distributes work across two\nthreads split along the output channel axis.\nWe show how work is split when computing the 2D convolution in the figure\nbelow.\n\n![](https://raw.githubusercontent.com/uwsampl/web-data/main/vta/tutorial/virtual_threading.png)\n\n     :align: center\n     :width: 480px\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# VTA only supports 2 virtual threads\nv_threads = 2\n\n# Perform virtual thread split along output channel outer axis\n_, tx = s[res].split(oc_out, factor=v_threads)\ns[res].reorder(tx, b_out)\ns[res].bind(tx, te.thread_axis(\"cthread\"))\n\n# Let's look at the current TVM schedule after blocking and virtual threading\nprint(tvm.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lowering Copies to DMA Transfers\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNext we set the buffer scopes to the corresponding on-chip VTA SRAM buffers.\nWe move the load loops into the 2D convolution computation loop to stage\nmemory loads such that they fit in the on-chip SRAM buffers.\nFinally we annotate the load/store loop outer axes with the DMA copy pragma\nto perform bulk memory transfers on VTA.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Set scope of SRAM buffers\ns[data_buf].set_scope(env.inp_scope)\ns[kernel_buf].set_scope(env.wgt_scope)\ns[res_conv].set_scope(env.acc_scope)\ns[res_shr].set_scope(env.acc_scope)\ns[res_min].set_scope(env.acc_scope)\ns[res_max].set_scope(env.acc_scope)\n\n# Block data and kernel cache reads\ns[data_buf].compute_at(s[res_conv], ic_out)\ns[kernel_buf].compute_at(s[res_conv], ic_out)\n\n# Use DMA copy pragma on DRAM->SRAM operations\ns[data_buf].pragma(s[data_buf].op.axis[0], env.dma_copy)\ns[kernel_buf].pragma(s[kernel_buf].op.axis[0], env.dma_copy)\n\n# Use DMA copy pragma on SRAM->DRAM operation in each result block\n# (this implies that these copies should be performed along b_inn,\n# or result axis 4)\ns[res].pragma(s[res].op.axis[4], env.dma_copy)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Lowering Computation to VTA Compute Intrinsics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nThe last phase is to lower the computation loops down to VTA hardware\nintrinsics by mapping the 2D convolution to tensor intrinsics,\nand mapping the shift, and clipping computation to the vector ALU.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Apply tensorization over the batch tensor tile axis\ns[res_conv].tensorize(b_tns, env.gemm)\n\n# Add an ALU pragma over the shift and clipping operations\ns[res_shr].pragma(s[res_shr].op.axis[0], env.alu)\ns[res_min].pragma(s[res_min].op.axis[0], env.alu)\ns[res_max].pragma(s[res_max].op.axis[0], env.alu)\n\n# Let's look at the final lowered TVM schedule after lowering memory\n# loads/stores down to DMA copy intrinsics, and the computation down to\n# VTA compute intrinsics.\nprint(vta.lower(s, [data, kernel, res], simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "TVM Compilation and Verification\n--------------------------------\nAfter specifying the schedule, we can compile it into a TVM function.\nWe save the module so we can send it over RPC.\nWe run the function and verify it against a numpy implementation to\nensure correctness.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# This library facilitates 2D convolution testing\nfrom tvm.topi.testing import conv2d_nchw_python\n\n# Compile the TVM module\nmy_conv = vta.build(s, [data, kernel, res], \"ext_dev\", env.target_host, name=\"my_conv\")\ntemp = utils.tempdir()\nmy_conv.save(temp.relpath(\"conv2d.o\"))\nremote.upload(temp.relpath(\"conv2d.o\"))\nf = remote.load_module(\"conv2d.o\")\n\n# Get the remote device context\nctx = remote.ext_dev(0)\n\n# Initialize the data and kernel arrays randomly in the int range\n# of (-128, 128] in NCHW layout\ndata_np = np.random.randint(-128, 128, size=(batch_size, in_channels, height, width)).astype(\n    data.dtype\n)\nkernel_np = np.random.randint(\n    -128, 128, size=(out_channels, in_channels, kernel_h, kernel_w)\n).astype(kernel.dtype)\n\n# Apply packing to the data and kernel arrays from a 2D NCHW\n# to a 4D NCHWnc packed layout\ndata_packed = data_np.reshape(\n    batch_size // env.BATCH, env.BATCH, in_channels // env.BLOCK_IN, env.BLOCK_IN, height, width\n).transpose((0, 2, 4, 5, 1, 3))\n\nkernel_packed = kernel_np.reshape(\n    out_channels // env.BLOCK_OUT,\n    env.BLOCK_OUT,\n    in_channels // env.BLOCK_IN,\n    env.BLOCK_IN,\n    kernel_h,\n    kernel_w,\n).transpose((0, 2, 4, 5, 1, 3))\n\n# Format the input/output arrays with tvm.nd.array to the DLPack standard\ndata_nd = tvm.nd.array(data_packed, ctx)\nkernel_nd = tvm.nd.array(kernel_packed, ctx)\nres_nd = tvm.nd.array(np.zeros(output_shape).astype(res.dtype), ctx)\n\n# Clear stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    simulator.clear_stats()\n\n# Invoke the module to perform the computation\nf(data_nd, kernel_nd, res_nd)\n\n# Verify against numpy implementation\nres_ref = conv2d_nchw_python(\n    data_np.astype(env.acc_dtype),\n    kernel_np.astype(env.acc_dtype),\n    (stride_h, stride_w),\n    (pad_h, pad_w),\n).astype(env.acc_dtype)\nres_ref = res_ref >> env.INP_WIDTH\nres_ref = np.clip(res_ref, 0, inp_max)\nres_ref = res_ref.astype(res.dtype)\nres_ref = res_ref.reshape(\n    (\n        batch_size // env.BATCH,\n        env.BATCH,\n        out_channels // env.BLOCK_OUT,\n        env.BLOCK_OUT,\n        fout_height,\n        fout_width,\n    )\n).transpose((0, 2, 4, 5, 1, 3))\ntvm.testing.assert_allclose(res_ref, res_nd.numpy())\n\n# Print stats\nif env.TARGET in [\"sim\", \"tsim\"]:\n    sim_stats = simulator.stats()\n    print(\"Execution statistics:\")\n    for k, v in sim_stats.items():\n        print(\"\\t{:<16}: {:>16}\".format(k, v))\n\nprint(\"Successful 2D convolution test!\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Summary\n-------\nThis tutorial demonstrates how TVM scheduling primitives can be used to\nlower 2D convolution onto hardware accelerator intrinsics, making\nuse of hardware specific optimizations, such as latency hiding with\nvirtual threading.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}